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Abstract 

We provide a unified treatment of a broad class of noisy structure recovery problems, known as struc¬ 
tured normal means problems. In this setting, the goal is to identify, from a finite collection of Gaussian 
distributions with different means, the distribution that produced some observed data. Recent work has 
studied several special cases including sparse vectors, biclusters, and graph-based structures. We establish 
nearly matching upper and lower bounds on the minimax probability of error for any structured normal 
means problem, and we derive an optimality certificate for the maximum likelihood estimator, which can 
be applied to many instantiations. We also consider an experimental design setting, where we generalize 
our minimax bounds and derive an algorithm for computing a design strategy with a certain optimality 
property. We show that our results give tight minimax bounds for many structure recovery problems and 
consider some consequences for interactive sampling. 


1 Introduction 

The prevalence of high-dimensional signals in modern scientific investigation has inspired an influx of re¬ 
search on recovering structural information from noisy data. These problems arise across a variety of sci¬ 
entific and engineering disciplines; for example identifying cluster structure in communication or social 
networks, multiple hypothesis testing in genomics, or anomaly detection in sensor networking. Specific 
structural assumptions include sparsity 03, low-rankedness CD, cluster structure ca, and many oth¬ 
ers 

The literature in this direction focuses on three inference goals; detection, localization or recovery, and 
estimation or denoising. Detection tasks involve deciding whether an observation contains some meaningful 
information or is simply ambient noise, while recovery and estimation tasks involve more precisely charac¬ 
terizing the information contained in a signal. These problems are closely related, but also exhibit important 
differences, and this paper focuses on the recovery problem, where the goal is to identify, from a finite 
collection of signals, which signal produced the observed data. 

One frustration among researchers is that algorithmic and analytic techniques for these problems differ 
significantly for different structural assumptions. This issue was recently resolved in the context of the 
estimation, where the atomic norm HI has provided a unifying algorithmic and analytical framework, but no 
such theory is available for detection and recovery problems. In this paper, we provide a unification for the 
recovery problem, leading to deeper understanding of how signal structure affects statistical performance. 
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Modern measurement technology also often provides flexibility in designing strategies for data acqui¬ 
sition, and this adds an element of complexity to inference tasks. Data acquisition by both interactive and 
non-interactive experimental design is the typical situation in domains ranging from network tomography to 
crowdsourcing, but the statistical implications of these techniques are not fully understood. This paper also 
considers the experimental design setting and provides a generic solution and analysis of non-interactive 
experimental design for structure recovery problems. 

To concretely describe our main contributions, we now develop the decision-theoretic framework of this 
paper. We study the structured normal means problem defined by a finite collection of vectors V = 
{vj}j^i C that index a family of probability distributions Pj = N'{vj,Id)- An estimator T for the 
family V is a measurable function from to [M], and its maximum risk is: 

7^(T,V) = sup 7^,(T,V) = P,[T(y) ^ j], 

jG[M] 

where we always use y ~ Pj to be the observation. We are interested in the minimax risk: 

7^(V) = inf7^(T, V) = inf sup P,[T(j/) ^ j]. (1) 

We call this the isotropic setting because each gaussian has spherical covariance. We are specifically inter¬ 
ested in understanding how the family V influences the minimax risk. This setting encompasses recent work 
on sparsity recovery US, biclustering iniiT], and many graph-based problems An important example 
to keep in mind is the A:-sets problem, where the collection V is formed by vectors ylg for subsets S C [d] 
of size k and some signal strength parameter p. Instantiation of our results to this example will determine 
the critical scaling of y in terms of d and k that is necessary and sufficient for achieving asymptotically zero 
minimax risk. 

In the experimental design setting, the statistician specifies a sensing strategy, defined by a vector 
B G PjJ.. Using this strategy, under P^, the observation is, for each i G [d]: 

y{i) ~ Vj{i) + 1) = Afivj{i), B{i)~^). (2) 

If B{i) = 0, then we say that y{i) = 0 almost surely. We call this distribution Pj,_B, to denote the dependence 
both on the target signal Vj and the sensing strategy B. The total measurement effort, or budget, used by 
the strategy is ||i?||i, and we are interested in signal recovery under a budget constraint. Specifically, the 
minimax risk in this setting is: 


7^(V,r) 


inf sup P,- B\T{y) 7 ^ j]- 


(3) 


With this background, we now state our main contributions: 

1. We give nearly matching upper and lower bounds on the minimax risk for both isotropic and experi¬ 
mental design settings (Theorems [^and[^. This result matches many special cases that we are aware 
of 12^ . Moreover, in examples with an asymptotic flavor, including the k-sets example, this shows 
that the maximum likelihood estimator (MLE) achieves the minimax rate. 

2. In the isotropic case, we derive a condition on the family V under which the MLE exactly achieves 
the minimax risk, thereby certifying optimality of this estimator. 

3. We give sufficient conditions that certify optimality of an experimental design strategy and also give 
an algorithm for computing such a strategy prior to data acquisition. 
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4. Lastly, we provide many examples to demonstrate the generality and applicability of our results. 

We adopt the following notation; For a natural number M, we use [M] to denote the set {1,..., M}. 
For a sequence of problems indexed by a natural number n G N and a signal strength parameter p, we 
often state results in terms of the minimax rate and use the notation /r x f{n) to denote this asymptotic 
scaling. This notation means that if /r = uj{l)f{n), then the minimax risk can be driven to zero, while if 
fjL = o(l)/(n), then the minimax risk approaches one asymptotically. Finally, for vectors v, M G R'^, we 
use ||u||m = \/v"^dmg{M)v to denote the Mahalanobis norm. 

2 Related Work 

The structured normal means problem has a rich history in statistics and recent attention has focused on 
combinatorial structures. This line is motivated by statistical applications involving complex data sources, 
such as tasks in graph-structured signal processing ll22l . Focusing on detection problems, a number of 
papers study various combinatorial structures, including A:-sets m, cliques 1^ , paths m, and clusters fT2\ 
in graphs. While there are comprehensive results for many of these examples, a unifying theory for detection 
problems is still undeveloped. 

Turning to recovery or localization, again several specihc examples have been analyzed. The most 
popular example is the biclustering problem GSl III ES, which we study in Section |4.2| However, apart 
from this example and a few others E3\ . minimax bounds for the recovery problem are largely unknown. 
Moreover, we are unaware of a broadly applicable analysis like the method we develop here. 

A unihed treatment is possible for estimation problems, where the atomic norm framework gives sharp 
phase transitions on the mean squared error of the maximum likelihood estimator |[8l[3|20l. The atomic 
norm is a generic approach for encoding structural assumptions by decomposing the signal into a sparse 
convex combination of a set of base atoms (e.g., one-sparse vectors). While this line primarily focuses on 
linear inverse problems UlO, there are results for the estimation problem ll^ 1^. although neither set of 
results gives lower bounds on the minimax risk. Note that this approach is based on convex relaxation, and 
it is not immediate that such a relaxation will succeed for the recovery problem, as the probability of error 
for any dense family is one. Relatedly, the non-convexity of our risk poses new challenges that do not arise 
with the mean squared error objective. 

The recovery problem we consider here has also been extensively studied in the signal processing and 
information theory literature, where it is referred to as Gaussian detection, or decoding with Additive White 
Gaussian Noise (AWGN), although the motivation and results are quite different. As the goal in channel 
coding is to reliably transmit as many bits of information as possible across a noisy channel, the vast majority 
of channel coding results focus on codebook design US). Researchers have studied structured (random and 
non-random) codebooks solely for computational efficiency, as random codes achieve optimal transmission 
rates but lead to a computationally intractable decoding problem. In contrast, in our setting the structured 
codebook is inherent to the problem and the main object of interest; the analyst has no control over the 
codebook and wants to achieve optimal decoding performance for the codebook specified. 

Nevertheless, one line of work from this community analyzes the error probability for the maximum like¬ 
lihood estimator/decoder for a given codebook (See ETl for a survey). Classical upper bounds include the 
min-distance bound and the Gallager bound ifT^ . although the bound we prove here is also well-known lfT4l . 
Lower bounds come in two flavors: (a) sphere-packing lower bounds and (b) lower bounds on the maximum- 
likelihood error probability. The former is a lower bound that is independent of the particular codebook, so 
it does not give tight bounds on any specihc family of vectors, while the latter applies only to the MLE, so it 
does not relate to the minimax risk. In contrast, our technique simultaneously applies to any codebook and 
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any estimator, leading to precise lower bounds on the minimax risk. To our knowledge, apart from the upper 
bound in Theorem]^ the results proved here do not appear in the information theory literature. 

Turning briefly to the experimental design setting, a number of recent advances aim to quantify the 
statistical improvements enabled by experimental or interactive design in specific normal means instantia¬ 
tions lfT3ll2^ . A unifying, interactive algorithm was recently proposed in the bandit optimization setting 13 
but it is not known to improve on non-interactive approaches for the setting we consider. This work makes 
important progress, but a general-purpose interactive algorithm and a satisfactory characterization of the 
advantages offered by interactive sampling remain elusive open questions. This paper makes progress on the 
latter by developing lower bounds against all non-interactive approaches. 


3 Main Results 

In this section we develop the main results of the paper. We start by bounding the minimax risk in the 
isotropic setting, then develop a certificate of optimality for the maximum likelihood estimator. Lastly, we 
turn to the experimental design setting. We provide proofs in Appendix [A| 

3.1 Bounds on the Isotropic Minimax Risk 

In the isotropic case, recall that we are given a finite collection V of vectors {vj}^i and an observation 
y Af{vj,Id) for some j G [M]. Given such an observation, a natural estimator is the maximum likelihood 
estimator (MLE), which outputs the index j for which the observation was most likely to have come from. 
This estimator is defined as: 


T'MLE(y) = argmaxPj(j/) = argmin \\vj - yWj. (4) 

]e[M] fe[M] 

We will analyze this estimator, which partitions based on a Voronoi Tessellation of the set V. 

As stated, the running time of the estimator is 0{Md), but it is worth pausing to remark briefly about 
computational considerations. In many examples of interest, the class V is combinatorial in nature, so M 
may be exponentially large, and efficient implementations of the MLE may not exist. However, as our setup 
does not preclude unstructured problems, the input to the estimator is the complete collection V, so the 
running time of the MLE is linear in the input size. If the particular problem is such that V can be compactly 
represented (e.g. it has combinatorial structure), then the estimator may not be polynomial-time computable. 
This presents a real issue, as researchers have shown that a minimax-optimal polynomial time estimator is 
unlikely to exist for the biclustering problem ifTOl [TSll . which we study in Section |3 However, since the 
primary interest of this work is statistical in nature, we will ignore computational considerations for most of 
our discussion. 

Our first result is a generic characterization of the minimax risk, which involves analysis of the MLE. 
The following function, which we call the Exponentiated Distance Function, plays a fundamental role. 

Definition 1. For a family V and a > 0, the Exponentiated Distance Function (EDF) is: 


W(V,a) = max Wj(V,a) 

iG[M] 

(5) 

W,(V.a) 

k^j ^ ^ 

(6) 
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In the following theorem, we show that the EDF governs the performance of Tmle- More importantly, 
this function also leads to a lower bound on the minimax risk, and the combination of these two statements 
shows that the MLE is nearly optimal for any structured normal means problem. 

Theorem 2. Fix 6 G (0,1). /flV(V,8) < 6, then 7?.(V) < TI{V,Tmle) < On the other hand, if 
W{V, 2(1 - ^)) > 2T^ - 1, then 7^(V) > 5. 

By setting 5 = 1/2 above, the second statement in the theorem may be replaced by; If IE(V, 1) > 3, 
then 'R-iy ) > 1/2. This setting often aids interpretability of the lower bound. Notice that the value of a 
disagrees between the lower and upper bounds, and this leads to a gap between the necessary and sufficient 
conditions. This is not purely an artifact of our analysis, as there are many examples where the MLE does 
not exactly achieve the minimax risk. 

However, most structured normal means problems of interest have an asymptotic flavor, specified by a 
sequence of problems Vi, V 2 ,..., and a signal-strength parameter p, with observation y ^ JV{y,Vj, If) for 
some signal Vj in the current family. In this asymptotic framework, we are interested in how p, scales with 
the sequence to drive the minimax risk to one or zero. Almost all existing examples in the literature are of 
this form ll23l . and in all such problems. Theorem]^ shows that the MLE achieves the minimax rate. To our 
knowledge, such a comprehensive characterization of recovery problems is entirely new. 

Note that the quantity 72.(V, Tmle) is simply the worst case probability of error for the MLE, which 
has been extensively studied in the information theory community. Classical upper bounds on this quantity 
include the min-distance bound and Gallager’s bound ifT^ . but the EDF-based bound here has also appeared 
in the literature m. It is well known that the min-distance bound is often extremely loose (see Section]^, 
while application of Gallager’s bound often involves challenging calculations ED. The main novelty in our 
result is the lower bound, which shows that the EDF also leads to a lower bound on the error probability 
for all estimators/decoders, generically for all families V. This new lower bound, accompanied with the 
existing upper bound, establishes that the EDF is the fundamental quantity in characterizing the minimax 
risk in these recovery problems. 

Application of Theorem]^ to instantiations of the structured normal means problem requires bounding 
the EDF, which is significantly simpler than the typical derivation of this style of result. In particular, proving 
a lower bound no longer requires construction of a specialized subfamily of V as was the de facto standard 
in this line of work umim. In Section]^ we show how simple calculations can recover existing results. We 
also show how the EDF-based bound often gives much sharper results than the min-distance bound. 

Turning to the proof, the EDF arises naturally as an upper bound on the failure probability of the MLE 
after applying a union bound and a Gaussian tail bound. Obtaining a lower bound based on the EDF is 
more challenging, and our proof is based on application of Fano’s Inequality. We use a version of Fano’s 
Inequality that allows a non-uniform prior and explicitly construct this prior using the EDF. This leads to 
our more general lower bound. 

3.2 Minimax-Optimal Recovery 

Theorem]^ shows that that maximum likelihood estimator achieves rate-optimal performance for all struc¬ 
tured normal means recovery problems with the asymptotic flavor described. However, in some cases the 
MLE does not achieve the exact minimax risk, and it is therefore not the optimal estimator. In this section, 
we derive a sufficient condition for the exact minimax optimality of the MLE. As we will see via examples, 
the MLE is minimax optimal for several well-studied instantiations of the structured normal means problem, 
although analytically calculating the minimax risk is challenging. 

The sufficient condition for optimality depends on a particular geometric structure of the family V: 
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Definition 3. A family V is unitarily invariant if there exists a set of orthogonal matrices {Ri}fLi such that 
for each vector v G V, the set {Riv}^i is exactly V. 

In other words, the instance V can be generated by applying the orthogonal transforms to any hxed vector 
in the collection. Unitarily invariant problems exhibit high degrees of symmetry, and our next result shows 
that this symmetry suffices to certify optimality of the MLE. 

Theorem A. If V is unitarily invariant, then the MLE is minimax optimal. 

Some remarks about the theorem are in order: 

1. This theorem reduces the question of optimality to a purely geometric characterization of V and, as we 
will see, many important problems are unitarily invariant. One common family of orthogonal matrices 
is the set of all permutation matrices on IR'^. 

2. This result does not characterize the risk of the MLE; it only shows that no other estimator has better 
risk. Specifically, it does not provide an analytic bound that is sharper than Theorem Erom an 
applied perspective, an optimality certihcate for an estimator is more important than a bound on the 
risk as it helps govern practical decisions, although risk bounds enable theoretical comparison. 

3. Lastly, the result is not asymptotic in nature but rather shows that the MLE achieves the exact minimax 
risk for a fixed family V. We are not aware of any other results in the literature that certify optimality 
of the MLE under our measure of risk. 

The proof of this theorem is based on showing that the point-wise risk TZj{T, V) for the MLE is con¬ 
stant across the hypotheses j G [M]. This argument uses the unitary invariance property and an explicitly 
representation of the MLE as a collection of polyhedral acceptance regions, with one region per hypothesis. 
Finally, we employ a dual characterization of the minimax risk to show that if the point-wise risk functional 
is constant, then the MLE is minimax-optimal. 

In problems where Theorem|^can be applied, our results give a complete understanding of the isotropic 
case. We know that the MLE exactly achieves the minimax risk and Theorem]^ also gives upper and lower 
bounds that match asymptotically. 


3.3 The Experimental Design Setting 

We now turn to the experimental design setting, where the statistician specifies a strategy R G and 
receives observation p ^ Pj^b given by Equation]^ Our main insight is that the choice of R only changes 
the metric structure of R'^, and this change can be incorporated into the proof of Theorem Specifically, 
the likelihood for hypothesis j, under sampling strategy R is: 


d 

p.(?/is)=n 

i=l 



-R(i)(vj(i)-p(i))^ ] 


and the maximum likelihood estimator is: 


TMLE(y,B) = argminlluj - y\\%. 

je[M] 

We port Theorem]^ to this setting and show the following: 
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Theorem 5. Fix S G (0,1) and any sampling strategy B with \\B\\ i < t. Define the Sampling Exponenti¬ 
ated Distance Function SEDF: 


W(V, a, B) = max 

jG[M] 


^exp 



(7) 


IfW{V, 8,B) <S then 7^(V, t) < 7^(V, TMisiy, B)) < 5. Conversely, ifW{V, 2(1 - 5), B) > 2^ - 1, 
then infTSupjg[M] Pj,B[T{y) ^ j] > S. 

The structure of the theorem is almost identical to that of Theorem but it is worth making some 
important observations. First, the theorem holds for any non-interactive sampling strategy B G so the 
upper bound is strictly more general than Theorem]^ Therefore, this theorem also gives bounds for the non¬ 
isotropic or heteroskedastic case with known, shared covariance. Secondly, any strategy can be used to derive 
an upper bound on the minimax risk, but the same is not true for the lower bound. Instead the lower bound is 
dependent on the strategy, so one must minimize over strategies to lower bound 7Z(y, r). Fortunately, since 
the SEDF is convex in B and the budget constraint is polyhedral, computing the best strategy can be solved 
by convex programming. This gives a new algorithm for designing sampling procedures. 

Specifically, for any a, to compute an associated design strategy, we solve the convex program. 


minimizeBgKi.||B||i<r max 

+ jG[M] 


^exp 



(8) 


to obtain the sampling strategy B that minimizes the SEDF. For example, solving Program with a = 1 
results in a strategy B, and if W{V,1,B) > 3, then we know that the minimax risk TZ{V,t) over all 
strategies is at least 1/2. On the other hand, solving with a = 8 to obtain a (different) sampling strategy 
B and then using B with the MLE would give the tightest upper bound on the risk attainable by our proof 
technique. In Sectionj^ we demonstrate an example where this optimization leads to a non-uniform sampling 
strategy that outperforms uniform sampling. 

In general it is challenging to analytically certify that an allocation strategy B minimizes the SEDF, but 
in some cases it is possible. Since the SEDF is convex in B, specializing the first-order optimality conditions 
for the resulting convex program gives the following: 

Proposition 6. Let B be a sampling strategy with ||i3||i = r, S{B) C V be the set of hypotheses achieving 
the maximum in W (V, a, B), and n be a distribution on S{B). If the quantity. 


exp{-\\vk - Vj\\%), 


is constant across i G [d], then B is a minimizer ofW (V, a,B) subject to ||i3||i < r. 

In many cases, this result leads to analytic lower bounds. Specifically, the result is especially useful when 
B is uniform across the coordinates, and S{B) = \M], so that all of the hypotheses achieve the maximum. 
In this case, it often suffices to choose tt to be uniform over the hypotheses and exploit the high degree of 
symmetry to demonstrate the condition holds. As we will see in Section]^ many examples studied in the 
literature exhibit the requisite symmetry for this proposition to be applied in a straightforward manner. 

Note that Tanczos and Castro ll23l establish a similar sufficient condition for the uniform sampling strat¬ 
egy to be optimal. Their result however is slightly less general in that it only certifies optimality for the 
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uniform sampling strategy, whereas ours, in principle, can be applied more universally. In addition, their 
result applies only to problems where the hypotheses are of the form fils for a collection of subsets while 
ours is more general. This generality is important for some examples in Section]^ The other main difference 
is that their approach is not based on the SEDF, so their result is not directly applicable here. 


4 Examples 

This section contains four instantiations of structured normal means problems, and concrete results easily 
attainable from our approach. These examples have the asymptotic flavor described before, where we are 
interested in how a signal strength parameter p scales with a sequence of instances. 

The first example, the k-sets problem, is well studied, and as a warmup, we show how our technique 
recovers existing results. The second and third examples are motivated by biclustering and hierarchical 
clustering applications; in both problems our techniques establish a lower bound against all non-interactive 
approaches, demonstrating separation between interactive and non-interactive procedures. In the hierarchical 
clustering case, this separation is new. Finally the fourth example is a graph-structured signal processing 
problem where we empirically show that uniform sampling can be sub-optimal. The requisite calculations 
for these examples are deferred to Appendix [A| 

4.1 k-sets 

In the fc-sets problem, we have M = and each vector Vj = where Sj C [d] and \Sj\ = k. The 
observation is y ~ Af{fiVj,Id) for some hypothesis j. 

Corollary 7. The minimax rate for k-sets is fi ■\/\og{k{d — k)) and fi x log(fc(d — k)), with budget 
T. In the isotropic case, the MLE is minimax optimal. 

This corollary follows simply by bounding the FDF for the fc-sets problem using binomial approxima¬ 
tions. Using Proposition]^ it is easy to verify that uniform sampling is optimal here, which immediately 
gives the second claim. Finally using the set of all permutation matrices and exploiting symmetry, we can 
easily verify that this class is unitarily invariant. These bound agrees with established results in the liter¬ 
ature 12^ . To contrast, the min-distance bound from classical coding theory would reveal that the MFF 
succeeds when p, = io{\Jk\og{d/k)). This bound is polynomially worse than the one attained by our 
more-refined FDF-based bound. 

4.2 Biclusters 

The biclustering problem operates over with M = (^) . We parametrize the class V with two indices 
so that Vij = ISilfi- G {0,with |S'i| = |S'j| = k. The observation is y ~ Af {pvec{vij), for a 

hypothesis {i,j). 

Corollary 8. The minimax rate for biclusters is p x ^ ^ ^^ log{k{d — k)), with 

budget constraint r. In the isotropic case, the MLE is minimax optimal. 

Our bounds agree with existing analyses of this class insiiTiiia. The biclustering problem is interesting 
because there is a simple interactive algorithm that succeeds if p = ui { J-\- f) logd I, which is a 
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factor of Vk smaller than the lower bound established here, demonstrating concrete statistical gains from 
interactivity ll23l . We provide an analysis of this interactive algorithm in Appendix [A| Note also that our 
EDF-based bound is polynomially better than classical min-distance bound. 

4.3 Hierarchical Clustering 

We study a model for similarity-based hierarchical clustering from Balakrishnan et al. 0 . This model is 
known as the Constant Block Model (CBM) and the balanced version is parameterized by a number of 
objects n, a minimum cluster size m, both of which are powers of 2, and a separation parameter p. The 
hierarchical clustering is a perfectly balanced binary hierarchy on n objects with minimum cluster size m, 
where within cluster similarities are exactly p larger than the between-cluster similarities at each level of the 
hierarchy. The induced n x n similarity matrix is related to an ultrametric (We provide a formal dehnition 
in Appendix [A|). 

Balakrishnan et al. 0 analyze a recursive spectral clustering algorithm on this model in the presence of 
Gaussian noise. By associating the similarity matrix of each possible hierarchical clustering with an element 
of V, their setting is a special case of the structured normal means problem, and our results can be used to 
characterize the minimax rate. 

Corollary 9. In the CBM, if fj, = o tmd under budget constraint t, if fi = o 

then the minimax risk remains bounded away from 0. In the isotropic case, the MLE is minimax optimal. 

This corollary is a strict generalization of the minimax analysis of Balakrishnan et al. 0 who only 
consider the case m = n/2. Moreover, the results of Tanczos and Castro ll2^ do not apply here as the 
family does not correspond to indicator vectors of some set system. Thus, our more general treatment 
enables analysis of this important settings. 

Note however that we only prove a lower bound on the minimax risk here. This lower bound, combined 
with existing analysis, establishes exponential separation between interactive and non-interactive approaches 
for this hierarchical clustering setting. The interactive algorithm of Krishnamurthy et al. ini can recover 
clusters of size m = fl(log^ n) with r = 0(npolylog(n)) and p = 0(1). On the other hand. Corollary]^ 
shows that, with these settings of /i and r, no non-interactive algorithm can recovery clusters of size m = 
o(n) which is exponentially worse than the interactive algorithm. 


4.4 Stars 


Let G = {V, E) be a graph and let the edges be numbered 1,..., d. The class V is the set of all stars in the 
graph, that is the vector Vj G {0,1}'^ is the indicator vector of all edges emanating from the jth node in the 
graph. Again the observation is y ^ JV{p,Vj,Id) for some j G [|L|]. 


Corollary 10. In the stars problem if the ratio between the maximum and minimum degree is bounded by a 


constant, i.e. < c, then the minimax rate is a ^ \ 


log(W|-ti'eg„in) 


Again this agrees with a recent result of Tanczos and Castro ll23l . who consider s-stars of the complete 
graph, formed by choosing a vertex, and then activating s of the edges emanating out of that vertex. The two 
bounds agree in the special case of the complete graph with s = |L| — 1, but otherwise are incomparable, 
as they consider different problem structures. Note that the degree requirement here is not fundamental in 
Theorem]^ but arises from problem-specihc approximations. 
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Figure 1: Left; A realization of the stars problem for a graph with 13 vertices and 34 edges with sampling 
budget r = 34. Edge color reflects allocation of sensing energy and vertex color reflects success probability 
for MLE under that hypothesis (warmer colors are higher for both). Isotropic (left) has minimum success 
probability of 0.44 and experimental design (center) has minimum success probability 0.56. Right: Maxi¬ 
mum risk for isotropic and experimental design sampling as a function of p. for stars problem on a 50 and 
100-vertex graph. 


We highlight this example because the uniform allocation strategy does not necessarily minimize W (V, a, B). 
In Eigure[^ we construct a graph according to the Barabasi-Albert model ^ and consider the class of stars 
on this graph. The simulation results show that optimizing the SEDE to And a sampling strategy is never 
worse than uniform sampling, and for low signal strengths it can lead to significantly lower maximum risk. 
Note that the risk for both uniform and non-uniform sampling approaches zero as p —> oo, so for large p, 
there is little advantage to optimizing the sampling scheme. 


5 Discussion 

This paper studies the structured normal means problem and gives a unified characterization of the minimax 
risk for isotropic and experimental design settings. Our work provides insights into how to choose estimators 
and how to design sampling strategies for these problems. Our lower bounds are critical in separating non¬ 
interactive and interactive sampling, which is an important research direction. 

There are a number of exciting directions for future work, including extensions to other structure discov¬ 
ery problems, and to other observation models, such as compressive observations. We are most interested 
in developing a unifying theory for interactive sampling, analogous to the theory developed here. Another 
important and challenging direction is to consider recovery problems that involve nuisance parameters. We 
look forward to studying these problems in future work. 
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A Proofs 

A.l Proof of Theorem 1^ 

Analysis of MLE: We first analyze the maximum likelihood estimator: 

TMLEiy) = argmin \\vj - y\\l 

je[M] 

This estimator succeeds as long as ||wfc — j/||| > \\vj* — y \\2 for each k ^ j*, when y ~ Pj*. This condition 
is equivalent to: 

hk-yWl > \\vj* -y\\l^ {e,Vk-Vj*) < ^\\vj* -VkWl 

where e ~ A/^(0, 1). This follows from writing y = Vj* + e and then expanding the squares. So we must 
simultaneously control all of these events, for fixed j*: 

Pe~7^A(0,/d) \y^ 7^ j •(Cj'ffc — Vj*) < II Uj* — rife II 2 / 2 ] 

= 1 - P£~A^(0./d) [3k j*.{e,Vk - Vj*) > \\vj* - Vk\\l/2] 

>1-51 Pf-AA(OTd) - Vj*) > \\vj* - Vk\\l/2] 

k^j* 


By a gaussian tail bound, this probability is: 


P£-AA(o./d) [{e,Vk-Vj*) > \\vy* -Ufell^/ 2 ] <exp|-^||uj* -ffeHaj 


So that the total failure probability is upper bounded by: 

Pj*[j =3*] < 51 expj-^lluj* - Ufcllaj = Wj*iV,8) 

k^j* ^ 

So if j is the truth, then the probability of error is smaller than 5 when Wj (V, 8 ) < <5. For the maximal (over 
hypothesis choice j) probability of error to be smaller than 5, it suffices to have W (V, 8 ) < <5. 

Fundamental Limit: We now turn to the fundamental limit. We start with a version of Fano’s inequality 
with non-uniform prior. 

Lemma 11 (Non-uniform Fano Inequality). Let 0 = {6} be a parameter space that indexes a family of 
probability distributions Pg over a space X. Fix a prior distribution tt, supported on 0 and consider 0 ~ tt 
and X ^ Pg. Let f : X ^ Q be any possibly randomized mapping, and let p^. = P 6 i~ 7 r,x~p« [f{^) 7 ^ 
denote the probability of error. Then: 


Pe > 1 - 


Y.yyE{e)KL{Pe\\P^) + \og2 
H{tt) 


where P,r(-) = 'Lg..^T^Pg{-) is the mixture distribution. In particular, we have: 

j:g7Ti0)KLiPg\\P^) +l0g2 


infsupPx~pJ/(A) f9]> miEe,...^Pxr-*Pg[f{X) 
f 9 f 


H{tt) 
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Proof. Consider the Markov Chain 9 ^ X ^ 6 = f(X) where 9 ^ tt and X\9 ~ Pg. Let E = 1[9 f 9\. 

H{E\X) + H{9\E, X) = H{E, 9\X) = H{9\X) + H{E\9, X) > H{9\X) 

Now, E[{E\9, X) > 0 and since conditioning only reduces entropy, we have the inequality 

H{9\X) < H{pf} + H{9\E, X) = H{pfj + H{9\E = 0, X)P[E = 0] + H{9\E = 1, X)P[E = 1] 

= H{pf) +PeH{9) 

which proves the usual version of Fano’s inequality. We want to write iF(0|X) in terms of the KL divergence, 
using the mixture distribution 

H{9\X) = H(9, X) - H{X) =Jj 2 A0)Pe{x) log dx 

= J Ps[x) log dx-^7r(6 »)log7r(6») 

0 9 

= -Y,AS)KL{Pe\\P^) + H{ti) 

6 


Combining these gives the bound: 

H{pe) + PeHi-K) > H{Tr) -'^Tr{9)KL{Pe\\P^), 

9 

By upper bounding (pe) < log 2 and rearranging we prove the claim. 


□ 


For a distribution tt C Am-i over the hypothesis, let P,r(’) = ‘^kPk{ ) be the mixture distribution. 
Then Fano’s inequality (Lemma [TT) states that the minimax probability of etTor is lower bounded by: 

= inf sup Pj[T(j/) f j] > inf ^ j] 

^ Ek^.KL{Pk\\P^) P\og2 

Hin) 

Fix 6 G (0,1) and let j* = argmax^-gjjyjj Wj (2(1 — d)). We will use a prior based on this quantity: 

\Vj* -VkWl 


TTfc (X exp — 


With this prior, the entropy becomes: 

H{tt) = y^TTfclog 


2(1 - S) 

'E.=p(-“%3F) 

V 2(1-5) ) 


exp - 


= log(lL(V,2(l-d)) + l) + y]^fc 


- Xk\\2 
2(1-d) 


log(lL(V,2(l - 5 )) + 1) + ^ y]7rfeiCL(Pfc||P,*) 
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The 1 inside the first log comes from the fact that in the definition Wj*, we do not include the term involving 
j* in the sum, while our prior tt does place mass proportional to 1 on hypothesis j*. The term involving the 
KL-divergence follows from the fact that the KL between two gaussians is one-half the € 3 -distance between 
their means. 

Looking at the lower bound from Fano’s inequality, we see that if; 

U^^KL{Pu\\P^) + log2 < (1 - 5)H{tt) = (1 - -5) log(M^(V, 2(1 - <5)) + 1) + ^ 7rfciTL(Pfc||P,*) 

k 

then the probability of error is lower bounded by S. Of course it is immediate that: 

= '£’"■ I + E / 

= Y,^kKL{Pk\\P^) + KL{P^\\Pj.) > EkKLiPkWP^) 

k 

So the condition reduces to requiring that; 

log2< (l-5)log(lF(V,2((5-l) + l). 

After some algebra, this is equivalent to: 

FF(V,2(5- 1 )) > 2 T^ - 1 □ 


A.2 Proof of Theorem |5] 

The proof of Theoremj^is essentially the same as the proof of Theorem]^ coupled with two observations. 
First, for a sampling strategy P G IR^J. the maximum likelihood estimator is: 

TMLE{y,B) = argminlluj - y\\% 
jelM] 

so the analysis of the MLE depends on the Mahalanobis norm II ■ IIb instead of the £2 norm. 

Similarly, the KL divergence between the distribution Pj^b and Pk.B depends on the Mahalanobis norm 
II • Ijs instead of the £2 norm. Specifically, we have: 

KLiP,M\Pk,B) = l\\vj-v4l. 

The lower bound proof instead use this metric structure, but the calculations are equivalent. □ 


A.3 Proof of Proposition 

To simplify the presentation, let f{B) = W (V, a, B). f{B) is convex and (strictly) monotonically decreas¬ 
ing, so we know that the minimum will be achieved when the constraint is tight, i.e. when ||P||i = r. The 
Lagrangian is: 

£(P,A) = /(P) + A(|lP|li-r) 
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and the minimum is achieved at B, with |jB||i = r, if there is a value A such that 0 G dC{B, A). Observing 
that the subgradient is df{B)+Xl, it suffices to ignore the Lagrangian term and instead ensure that df{B) oc. 
1. f{B) is a maximum of M convex functions, where fj (B) is the function corresponding to hypothesis Vj, 
and, by direct calculation, the subgradient of this function fj{B) is: 


dB, 




Moreover, the subgradient of the maximum of a set of functions is the convex hull of the subgradients of 
all functions achieving the maximum. This means that if there exists a distribution tt, supported over the 
maximizers of f{B), such that the expectation of the subgradients is constant, we have certihed optimality 
of B. This is precisely the condition in the Proposition. □ 


A.4 Proof of Theorem in 

Our approach is based on a well-known connection between the minimax risk and the Bayes risk. For a 
structured normal means problem defined by a family V, the Bayes risk for an estimator T under prior 
TT G Am -1 is given by: 


M 

B^{T) = Y,n,P,[T{y)^j]. 

j=i 

We say that an estimator T is the Bayes estimator for prior tt if it achieves the minimum Bayes risk. A simple 
calculation reveals the structure of the Bayes estimator for any prior tt and this structural characterization is 
essential to our development. 

Proposition 12. For any prior tt, the Bayes estimator has polyhedral acceptance regions, that is the 
estimator is of the form: 


T{Y) = j ify G A,, 

with Aj = {x : rjX > bj} andTj G has vj —Vk in the kth row and bj has Hi” llt'fclli) ^ 

in the kth entry. These polyhedral sets Aj partition the space R'^. 

Proof To prove Proposition [T^ we make two claims. First we certify that for a prior tt, the Maximum a 
Posteriori (MAP) estimator is a Bayes estimator for prior tt. Given tt, the map estimator is: 

T^^iy) = argmax7r(j) expl-H'Uj - y\\l/2} 
j 

Define the posterior risk of an estimator T to be the expectation of the loss, under the posterior distribution 
on the hypothesis. In our case this is: 

M 

r{T\y) = l[T{y) ^ j]TT{j\y) where TT{j\y) cx 7r(j) exp{-||uj - yWl^X). 

For a hxed y, this quantity is minimized by letting Tfy) be the maximizer of the posterior, as this makes the 
0—1 loss term zero for the largest TT{j\y) value. Thus for each y we minimize the posterior risk by letting 
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T{y) be the MAP estimate. The result follows by the well known fact that if an estimator minimizes the 
posterior risk at each point, then it is the Bayes estimator. 

This argument shows that the only types of estimators we need to analyze are MAP estimators under 
various priors. This gives us the requisite structure to prove Proposition [T^ 

Specihcally, for a prior tt, for the MAP estimate to predict hypothesis j, it must be the case that: 

\/k ^ j. TT^expl-lluj -y\\ll2} > Trkexp{-\\vj -Vk\\l/2}. 

This can be simplified to: 

{vj-vk,y) > \{\\vj\\l - llufell^) + log—. 

I TTj 

Thus the acceptance region for the hypothesis j is the set of all points y that satisfy all of these M — 1 
inequalities. This is exactly the polyhedral set Aj. □ 

We also exploit the relationship between the minimax risk and the Bayes risk. This is a well known 
result, where the prior tt below is known as the least-favorable prior. 

Proposition 13. Suppose that T is a Bayes estimator for some prior tt. If the risk TZj{T) = TZjfT) for all 
j f S [M], then T is a minimax optimal estimator. 

Proof. We provide a proof of this well-known result showing that the Bayes estimator with uniform risk 
landscape is minimax optimal. Let T^r be the Bayes estimator under prior tt and let Tg be some other 
estimator. Since T,r has constant risk landscape, we know that maxj TZjiV, T^) = i3T(T,r), or the minimax 
risk for T,r is equal to its Bayes risk. We know that the Bayes risk of Tq is at most the minimax risk for Tg, 
i.e. i3^(Tg) < maxj TZj{V, Tq). If it were the case that Tg had strictly lower minimax risk, then we have: 


B^riTo) < ma.xTZj{V,To) < max7?.j(V,T^-) < 

3 3 

However, this is a contradiction since is the Bayes estimator under prior tt, meaning that it minimizes the 
Bayes risk. □ 

Equipped with these results, we now turn to the proof of the theorem. 

Proof of Theorem 1^ Our goal is to apply Proposition [l3] By the fact that TZj{V,T) = 1 — 
where Aj is T’s acceptance region for hypothesis j, we must show that the Pj probability content of the 
acceptance regions are constant. Ignoring the normalization factor of the gaussian density, this is: 

/ exp{-||uj - a:||^/2}dx, 

Ja, 

where Aj = {zlFjZ > bj} as dehned in Proposition[l^ We will exploit the unitary invariance of the family. 

For any pair of hypothesis j, k, let Rjk be the orthogonal matrix such that Vk = RjkVj and note that 
Rkj, the orthogonal matrix that maps Vk to Vj, is just Rjj^. This also means that RjkRjk = RjkRkj = T 
Via a change of variables x = RkjiJ, the integrand becomes: 

exp{-||uj - Rkjy\\l/2} = exp{-\\RjkVj - RjkRkjy\\l/2} = exp{-||i;fc - y\\l/2}. 

Thus, we have translated to the Pk measure. 
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As for the region of integration, first note that since Vi = RjiVj, it must be the case that ||uj|j| = ||ui||i 
for all j, i G [M]. This means that the vector bj defining the acceptance region, which for the MLE has 
coordinates bj{i) = — Il't'illi)’ is just the all-zeros vector. The region of integration is therefore: 

{z\TjZ > 0} = {z\TjRkjZ > 0}. 

We must check that this polytope is exactly Ak, which means that we must check that for each i, (vj — 
Vif' Rkj is a row of the T^ matrix. But: 

{vj - v^^Rkj = vjRjk - vfRjk =vl - vfRjk- 

Since Vi can generate the family V, it must be the case that RjkVi G V so that this difference does correspond 
to some row of T^. Since we apply the same unitary operator to all of the rows, it must be the case that the 
number of distinct rows is unchanged, or in other words, there is a bijection from the rows in TjRkj to 
the rows in Tfc. Therefore, the transformed region of integration, after the change of variable x = RkjV is 
exactly the acceptance region Ak, and the integrand is the measure. This means that Pfc[Afc] = Pj[Aj] 
and this is true for all pairs (j, k), so that the risk landscape is constant. By Proposition [T^ this certifies 
optimality of the MLE. □ 

A.5 Calculations for the examples 

Calculations for fc-Sets; We must upper and lower bound W(V,a). Eirst note that by symmetry, every 
hypothesis achieves the maximum, so it suffices to compute just one of them: 

k 

exTp{-\\vk-Vj\\l/a) =Y^ 

k^j s=l 

This follows by noting that the distance between two hypothesis is the symmetric set difference between 
the two subsets, and then by a simple counting argument. Using well known bounds on binomial coefficients, 
we obtain: 

k 

a) < ^ exp(s log(A:e/s) + s\og{{d — k)e/s) — fa) 

k 

= ^ exp(s log(e^fc(d — k)/s^) — jot) 

s=\ 

< fcexp(loge^fc(d — k) — jot) ja > \og{e^k{d — k)) 

This is smaller than <5 whenever fj? > a \og{ek{d — k)/S), which subsumes the requirement above. Eor the 
lower bound: 

k 

W^(V, a) > ^ exp(s log(fc/s) + slog(((i — k)/s) — 2 s\j? jo) > exp(— 2p^/a + log(/c(d — k))) 

S=1 

which goes to infinity if jj? = o{a log(/c((i — k))). 

To certify that the uniform allocation strategy minimizes W (V, a,B),we apply Proposition]^ Eix r and 
let 13 be such that B{i) = r/d. By symmetry, every hypothesis achieves the maximum under this allocation 
strategy, and we will take tt to be the uniform distribution over all hypothesis. 


W(V,a) = Y^ 


0 C . *") “P(-2V/o). 
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For a hypothesis j and a coordinate i, the subgradient at Bi depends on the whether vj (i) = 0 or 

not. If Vj (i) = 0, then: 


dfAS) 

dB(i) 


AA)(tA) »p(-2v>v<i). 


and if vj (i) = then: 


dfjjB) 

dB(i) 




Both of these follow from straightforward counting arguments. Notice that the value of the subgradient 
depends only on whether Vj{i) = 0 or not, and under the uniform distribution tt , Ejr^TrVj{i) = Ejr^TrVj{i'). 
This implies that the constant vector is in the subgradient of f{B) at B, so that B is the minimizer of 
Wiy, a, B) subject to < r. 

We have already done the requisite calculation to bound the minimax risk under sampling. The calcu¬ 
lations above show that if /r = '^{\Jij: \og{k{d — k))) then the maximum likelihood estimator, when using 

the uniform sampling strategy has risk tending to zero. Conversely if /r = o{\J^ log(A:(ci — k))) then the 
minimax risk, for any allocation strategy tends to one. 

Calculation for Biclusters: Due to symmetry, all hypotheses achieve the maximum and therefore, we 
can directly calculate VF(V, a). We use the notation to denote the binomial coefficient ("). 


k k / n 2 \ 

W{V,a) = Y, Y exp -^{Sr{k - sA + sAk - Sr) + S.Se) 

\ O' J 


This last two term comes from the case where Sc = 0 or = 0, which is all of the hypotheses that share 
the same columns but disagree on the rows (or share the same rows but disagree on the columns). Using 
binomial approximations, the first term can be upper bounded by: 


k k 


v-^ / 1 k{d—k)e^ , k{d—k)e^ 2u^ 

< 2^ 2^ exp I Sr log-^-h Sc log-^-(Sr(A: - Sc/2) + Sc(fc 


8^ = 1 Sc = l 

k 
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The two terms here are identical, so we will just bound the first one; 


exp ( Sr (log 


k{d—k)e'^ 


< exp iySr (log(fc(d — k)e^) — k^'^/a)) 


< fcexp {\og{k{d — k)e^) — ky?/a) if ? log(A:(d — k)e^) 


Applying this inequality to both terms gives a bound on W (V, a). This bound is smaller than <5 as long as 
p > yj\og{k{d — k)e/S) for some universal constant c. Again this subsumes the condition required for 
the inequality to hold. 

The other two terms are essentially the same. Using binomial approximations, both expressions can be 
bounded as: 


^ / 9 2 \ ^ 

exp f - —{Srk)j = y^ exp(sr log(e^fc(d - k)/sl) - 2srky?/a) 

< kexp{\og{k{d — k)e^) — 2kfi'^/a) if log(fc(d — fc)e^). 

2k 

These bounds lead to the same minimax rate as above. 

For the lower bound, we again use binomial approximations. 


k k 


W{V,a)> y] y^ exp f s^log ^ + Sclog ^ - — (sr-(A: - Sc/2) + s^k - Sr/2)) 


= l Sc = l 
k 


> y^ exp ( Sr (log 


k{d — k)e^ 2fc/i^ 


> exp(log(A:(c? — k) — 2fi^kfa) 


s; a 

/^\2 


y^ exp ( Sc (log 


k{d — k)e^ 2fc/r^ 


a 


This lower bound goes to infinity if p = o(y ^ \og{k{d — k))) lower bounds the minimax rate. 

To certify that the uniform allocation strategy minimizes W (V, a, B), we apply Proposition|^ Fix r and 
let B be such that B{{a, b)) = r/d'^ for all (a, b) G [d] x [d]. By symmetry, every hypothesis achieves the 
maximum under this allocation strategy, and we will take tt to be the uniform distribution over all hypothesis. 
For a hypothesis j, let fj{B) denote the term in the SEDF centered around j. For a hypothesis j based 

on clusters Si, Sr and a coordinate (a, b), the subgradient at B{a, b) depends on whether a G Si and 

b G Sr- If a ^ Cl and b ^ Cr, then: 


dB{a, b) 


B=B 


^ E E {Sr{k - Sr/2) + Sr{k 


Sr/2))). 
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This follows by direct calculation. Similar calculations yield the other cases: 


dfjjB) 

dB{a,b) 


B=B 


dfjjB) 
dB{a, b) 


B=B 


dfAB) 

dB{a, b) 


B^B 


^ 51/ 51/exp( {sr{k 


s^=0 Sc=l 
2 k k — 1 


^ E E (Srik 


Sc) + Sc{k — Sr) + SrSc)). 
Sc) + Sc(k — Sr) + SrSc)). 


~~ 55 55 ^d-kCk-lCd-kCk-l^^P{ 


— 2TfA 
acP 


(Sr(fc 


Sc) + Sc{k — Sr) + SrSc)). 


These correspond to the cases a G Si,b ^ Sr, a ^ Si,b G Sr and the case where a G Si,b G Sr 
respectively. The main point is that the value of the subgradient depends only on presence or absence of the 
row/column in the cluster, and under the uniform distribution tt, each row/column is equally likely to be in 
the cluster. This means that for every coordinate (a, b) taking the expected subgradient with respect to the 
uniform distribution over hypotheses yields the same expression. So the constant vector is in the subgradient 
of f{B) at B, so that B is the minimizer ofW{V, a,B) subject to ||B|| i < r. 

We have already done the requisite calculation to bound the minimax risk under sampling. The calcu¬ 
lations above show that if p = u}{\J ^ log(fc(d — k))) then the maximum likelihood estimator, when using 

the uniform sampling strategy has risk tending to zero. Conversely if fj, = o{^ log(A:(d — k))) then the 
minimax risk, for any allocation strategy tends to one. 

The biclusters family is clearly unitarily invariant with respect to the set of orthonormal matrices that 
permute the rows and columns independently. The family is easiest to describe as acting on the matrices 
. Let Pi,Pr be any two dx dpermutation matrices. Then the matrix Pilsi is clearly another 

hypothesis, and as we vary Pi and Pr we generate all of the hypothesis. Note that these permutations are 
unitary operators on the matrix space which allows us to apply Theorem|^ 

For the analysis of the interactive algorithm, let us first bound the probability that the algorithm makes a 
mistake on any single coordinate. Consider sampling a coordinate x with mean p and noise variance 1 /b. A 
Gaussian tail bound reveals that: 


P[|a; — ^1 > e] < 2exp(—26e^). 

We will sample no more than cP coordinates and we will sample each coordinate with the same amount of 
energy b. So by the union bound, the probability that we make a single mistake in classifying a coordinate 
that we query is bounded by S/2 as long as: 

/i > 2e = y^log(4fi2/^). 

We now need to bound b, which depends on the total number of coordinates queried by the algorithm. In 
the first phase of the algorithm, we sample coordinates uniformly at random until we hit one that is active. 
Since each sample hits an active coordinate with probability k^/cP: 

P[hit active coordinate in T samples] = 1 — (1 — >1- Tk'^ jd? ’ 

or if T = ^ log(2/(5), the probability that we hit an active coordinate in T samples will be at least 1 — (5/2. 
The total number of samples we use then can be upper bounded by 2(i + ^ log(2/d), which means that we 
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can allocate our budget r evenly over these coordinates. Therefore we can set b = T(2d + ^ log(2/i5))“^, 
and plugging into the condition on p above proves the result. 

Calculations for Hierarchical Clustering: We first describe the heirarchical clustering model. 

We study the special case of balanced binary hierarchical clustering on n objects which, without loss of 
generality we call [n] = {1, ..., n}. A binary hierarchical clustering is a collection C of subset of [n], such 
[n] S C, and each S C, if |C^| > m, then there exists two sets C^oL, C^oR S C both of size C^/2 that 
partition C^. As a naming convention, we identify a cluster by a string ^ of L and R symbols. The two 
sub-clusters of a non-terminal cluster are C^oL and C^oR- The noisy Constant Block Model is defined 
using this terminology as follows. 

Definition 14. A similarity matrix W is a noisy constant block matrix (noisy CBM) ifW = A + R 
where A is ideal and R is a perturbation matrix: 

• An ideal similarity matrix is characterized by off-block diagonal similarity values j3^ S [0,1] for 
each cluster such that if x G C^oL cmd y € C^oR, where C^oL ond C^oR are two sub-clusters 
of at the next level in a binary hierarchy, then Ax,y = (5^. Additionally, min{/?^ofl,/J^ ol} > 
Define p, = min{min^{min{/3^oit, Pioh} — /?{}, Po\, where fio is the minimum overall similarity. 

• A symmetric (n x n) matrix R is a perturbation matrix with parameter a if (a) E{Rij) = 0, (b) the 

entries of R are subgaussian, that is E(exp(fi?y)) < exp and (c)for each row i, Rn,... Rin 

are independent. 

We focus on a subfamily of this model, parameterized by n, m, p where both n and m are powers of two. 
Our subfamily R consists of all perfectly balanced hierarchical clusterings on n objects with minimum clus¬ 
ter size m and where all similarities are an integral multiple of p. This is the simple hierarchical clustering 
model specified in Section Note that this set of model is a subset of the noisy CBM, so a lower bound for 
this family applies to the noisy CBM. Let V denote the class of all such matrices, parameterized by number 
objects n, minimum cluster size m, and signal strength p. We interpret V as a collection of vectors defined 
by VC = vec(M[C]) for each perfectly balanced hierarchical clustering C. 

We now prove Corollary For the first claim, by Theorem we must lower bound the quantity 
W(V,a), 


W(V,a) = max ^ exp (||uc - vcWl/a) 

C'/C 

Let Co be one of these models. Consider perturbing Co by taking an object and swapping that object with 
another one in the adjacent cluster at the deepest level of the hierarchy. There are nm/2 such perturbations 
and any perturbation C has = p^{?>m — 4). This gives the lower bound of: 

W{V,a) > ”^exp (^^(8m-4)^ 

By Theorem]^ if W (V, 1) > 3, then the minimax risk is bounded from above by 1/2. Applying our lower 
bound and solving for p proves the first part of the result. 

For the second claim, if we certify that the uniform sampling strategy minimizes W{'H, a, B) under the 
budget constraint, then we can immediately apply Theorem]^ We will use Proposition]^ to achieve this. 

It is easy to see that for the class V, when B is uniform, every one of these hypotheses achieves the max¬ 
imum in the definition of W (V, a, B). Moreover, notice that for every pair of pairs objects {a, b}, {o', b'}, 
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there is a bijection p over V based on swapping a with a' and b with b' in the hierarchy such that for any hy¬ 
pothesis VC, we have vc{a, b) = ^'p(c)(a^ b'). If in C, a and b are clustered at some level I, then by swapping 
a with a' and b with b' to form p{C), a' and b' are clustered at level I in p{C) so both terms will be identical 
because we are in a constant block model. 

Since p is a bijection, when we take tt to be uniform over the hypotheses, we have: 

Ec-tt X! ('^cia,b) - vc'{a,b)f ex.p{-\\vc' - vc\\l) 

C'^C 

= ,b') - ■Up(c/)(o',6')^exp(-||up(c') - Vp(^c)\\l) 

C'^C 

= i^c(.a',b') - vc'{a',b')fexp{-\\vc' - vc\\l). 

C'^C 

This means we may apply Proposition]^ which certifies that the uniform sampling minimizes the function 
W{H', a, B) under budget constraint. 

Equipped with this fact, we can reproduce the calculation above but with Bi = t/ giving: 

W{V,a,T) > ^exp 

This class is also unitarily invariant using the same tensorized permutation family from the biclustering 
example. Therefore the MLE is optimal. □ 

Calculation for Stars; Eor the stars problem, dehne Nb(j) C F to be the neighbors of the vertex j in 
the graph. Eor a hxed hypothesis j, we have 

Wj{V,a) = ^exp (-||ufe - Vk\\l/a) 

= exp(-/i2(deg(fc) + deg(j) - 2 )/a) + Y exp(-p2(deg(fc) + deg(j))/a) 

k^Nb{j) /c^Nb(j) 

< exp {-fi^deg^-Ja - p^deg{j)/a) (deg(j) exp{2p^/a) + |C| - deg(j)) 

This last inequality follows by replacing every deg(fc) with deg^^^j^^, the lower bound on the degrees. This 
last expression is maximized with deg(j) = deg^^^j^^, which can be observed by noticing that the derivative 
with respect to deg(j) is negative. This gives the bound: 

W{V,a) < exp {-2p^d&g^Ja) (dtg^^.^exp(2p?/a) + |1/| - deg^j^) 

One can lower bound W (V, a) by choosing the hypothesis j with deg(j) = deg^^^j^^ and then replacing 
all other degree terms with deg^^^g^^ in the above calculations. This gives: 

W{V, a) > exp (^-^(deg^jg + deg„^g^)^ (deg^j^e^^'/" + |y| - deg^i„) 
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